ClusterDE

a post-clustering differential expression method

Carson Zhang

July 1, 2024

Introduction

Background

Biology

RNA

RNA carries the genetic information specific in DNA. There are two main types of RNA.

Non-coding RNA performs some biological function.

Messenger RNA forms a template for protein production (it codes for a protein which performs some biological function).

scRNA-seq

TODO: give an explanation of scRNA-seq data collection and analysis.

Double-dipping

Biologists are interested in determining the cell types of the cells in a sample.

A natural approach:

  • Cluster the sample. We hope that each cluster represents a distinct cell type.
  • For each gene, perform a hypothesis test comparing its expression levels between the two clusters.
  • Applying a multiple testing correction procedure (e.g. Benjamini-Hochberg), determine which null hypotheses to reject: which genes you will say are differentially expressed.

Why testing after clustering is problematic

However, we used the data twice.

  1. First, we clustered the cells. The algorithm chose these two specific clusters because, in some sense, these two subsets of cells look the most different in terms of their gene expression levels.

  2. Next, we test each gene to see how different the expression levels are between the two clusters. However, we already know that this variation is there: that’s how we got the clusters in the first place. So we are testing for variation that we already found through clustering.

In other words: we forced the genes to exhibit different distributions in the two clusters, so we are likely to find this forced variation instead of true variation in the dataset.

Method

ClusterDE method: 4 steps

In words, the ClusterDE method can be broken up into four steps:

  1. Generate a synthetic null dataset that mimics the structure (in particular, the gene-gene correlation structure) of the original data.

  2. Separately partition the synthetic null data and the target data (real data) into two clusters.

  3. Separately for the null and target data, perform hypothesis tests for differentially expressed genes between the two clusters. For each gene, compute some sort of difference between the p-values on the two datasets.

  4. Output a subset of the significant results from step 3 as potential cell-type marker genes.

ClusterDE

A graphical illustration of ClusterDE.

1. Synthetic null generation

The synthetic null generation consists of three steps, as described in the following figure.

Null generation steps.

1. Model the null distribution in terms of the Gaussian copula.

Sklar’s Theorem

Theorem (Sklar’s Theorem): Let \(\mathbf{X}\) be a \(m\)-dimensional random vector with joint cumulative distribution function \(F\) and marginal distribution functions \(F_j, j = 1, ..., m\). The joint CDF can be expressed as

\[ F(x_1, ..., x_m) = C(F_1(x_1), ..., F_m(x_m)) \]

with associated probability density (or mass) function

\[ f(x_1, ..., x_m) = c(F_1(x_1), ..., F_m(x_m)) f_1(x_1) ... f_m(x_m) \]

for a \(d\)-dimensional copula \(C\) with copula density \(c\).

The inverse also holds: the copula corresponding to a multivariate CDF \(F\) with marginal distribution functions \(F_j, j = 1, ..., m\) can be expressed as

\[ C(u_1, ..., u_m) = F(F_1^{-1}(u_1), ..., F_m^{-1}(u_m)) \] , and the copula density (or mass) function is

\[ c(u_1,...,u_m) = \frac{f(F_1^{-1}(u_1), ..., F_m^{-1}(u_m))} {f_1(F_1^{-1}(u_1)) ... f_m(F_m^{-1}(u_m))} \] .

1. Why Sklar’s Theorem

Sklar’s Theorem allows us to choose \(C\), the specific copula function that we will use to approximate \(F\). For convenience, we will choose \(C\) to be a multivariate Gaussian distribution, since we know how to sample from it. Therefore, we have found a way to estimate the joint MVNB distribution of genes.

\[ C(\mathbf{u}; \mathbf{R}) = \Phi_{\mathbf{R}}(\Phi^{-1}(u_1), ..., \Phi^{-1}(u_m)) \]

Now, our goal is to estimate the parameters \({\mu_j, \sigma_j}_{j = 1}^m\) and \(\mathbf{R}\).

2. Fit the null model to the real data.

Recall that the power of the copula is that it allows us to consider the marginal distributions separately from the covariance structure. Therefore, we can proceed as follows.

For each gene \(j\), estimate \(\{\mu_j, \sigma_j\}\) using maximum likelihood. These are the marginal distributions.

2. (cont.)

Separately, use the Gaussian copula to model the dependence structure.

  • Transform the raw data (counts) to the CDF values of the counts.
  • Transform the discrete CDF into a continuous \(U(0, 1)\) variable: \(U_{ij} = V_{ij} \hat{F}_j (Y_{ij}) + (1 - V_{ij}) \hat{F}_j (Y_{ij})\).
  • Transform the CDF values to standard Gaussian random variables: compute \(\Phi^{-1}(U_{ij})\) for each \(j = 1, .., m\).
  • Fit a \(m\)-dimensional multivariate Gaussian distribution to this data to compute \(\mathbf{\hat{R}}\).

3. Sample from the fitted null model.

  • Generate a sample of size \(n\) from \(N_m(\mathbf{0}, \mathbf{\hat{R}})\).

\[ \begin{bmatrix} \tilde{Z}_{11} & \dots & \tilde{Z}_{1m} \\ \vdots & \ddots & \\ \tilde{Z}_{n1} & & \tilde{Z}_{nm} \end{bmatrix} \]

  • Convert them to negative binomial count vectors.

\[ \begin{bmatrix} \hat{F}_1^{-1}(\Phi(\tilde{Z}_{11})) & \dots & \hat{F}_m^{-1}(\Phi(\tilde{Z}_{1m})) \\ \vdots & \ddots & \\ \hat{F}_1^{-1}(\Phi(\tilde{Z}_{n1})) & & \hat{F}_m^{-1}(\Phi(\tilde{Z}_{nm})) \end{bmatrix} \]

2. Clustering

ClusterDE allows any clustering algorithm. Note that it only handles the case of two clusters, so if you started out with more clusters, you should identify a particular pair of interest. In the Practical guidelines for ClusterDE usage subsection, steps 1 and 2 describe how an analyst should proceed.

  1. Given \(\geq 2\) clusters, identify 2 clusters of interest. Generally, this will be a pair for which you suspect the clustering is spurious (i.e. you think the two clusters actually come from the same cell type, so they are strong candidates to be merged into a single cluster).

  2. Filter the data so that you only consider the subset of cells that come from those two clusters.

Clustering algorithms

UMAP

UMAP is common in scRNA-seq analysis.

TODO: summarize UMAP.

Louvain

The example analyses in the presentation use the default Seurat clustering procedure, which uses the Louvain algorithm.

TODO: describe the Seurat clustering pipeline.

TODO: summarize the Louvain algorithm.

3. DE analysis (testing)

ClusterDE allows any DE test.

TODO: choose and summarize a common DE test(s), e.g. Wilcoxon.

Let \(P_1, ..., P_m\) be the p-values computed by the \(m\) DE tests on the target data. Define the target DE score \(S_j := -\log_{10}P_j\). Likewise for the synthetic null data.

TODO: motivate the formula for the DE score. It is essentially the information content of the p-value.

The final outputs of step 3: \(m\) target DE scores \(S_1, ..., S_m\); \(m\) null DE scores \(\tilde{S}_1, ..., \tilde{S}_m\).

4. FDR control

Given the target and null DE scores, compute a contrast score for gene \(j\) as \(C_j := S_j - \tilde{S}_j\).

FDR control.

We choose the minimum \(t\) to maximize our discoveries, i.e. to “use” all of the false discoveries that we can.

Results

Simulation

Real data example

Appendix